The project aims to analyze the different features of the Google Playstore apps and explore any correlation between thee features - Free/Paid Apps, Size of the apps, Reviews and Rating from users, Ad supported apps, content rating, category of the apps, etc.
Some of the questions attempted to answer include:
import pandas as pd
import numpy as np
!pip install CurrencyConverter
from currency_converter import CurrencyConverter
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.express as px
import plotly
Requirement already satisfied: CurrencyConverter in c:\users\simi\anaconda3\lib\site-packages (0.16.4)
apps_df=pd.read_csv("google-playstore-apps/Google-Playstore.csv")
apps_df.shape
(2312944, 24)
apps_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2312944 entries, 0 to 2312943 Data columns (total 24 columns): # Column Dtype --- ------ ----- 0 App Name object 1 App Id object 2 Category object 3 Rating float64 4 Rating Count float64 5 Installs object 6 Minimum Installs float64 7 Maximum Installs int64 8 Free bool 9 Price float64 10 Currency object 11 Size object 12 Minimum Android object 13 Developer Id object 14 Developer Website object 15 Developer Email object 16 Released object 17 Last Updated object 18 Content Rating object 19 Privacy Policy object 20 Ad Supported bool 21 In App Purchases bool 22 Editors Choice bool 23 Scraped Time object dtypes: bool(4), float64(4), int64(1), object(15) memory usage: 361.8+ MB
apps_df.describe()
| Rating | Rating Count | Minimum Installs | Maximum Installs | Price | |
|---|---|---|---|---|---|
| count | 2.290061e+06 | 2.290061e+06 | 2.312837e+06 | 2.312944e+06 | 2.312944e+06 |
| mean | 2.203152e+00 | 2.864839e+03 | 1.834452e+05 | 3.202017e+05 | 1.034992e-01 |
| std | 2.106223e+00 | 2.121626e+05 | 1.513144e+07 | 2.355495e+07 | 2.633127e+00 |
| min | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 0.000000e+00 | 0.000000e+00 | 5.000000e+01 | 8.400000e+01 | 0.000000e+00 |
| 50% | 2.900000e+00 | 6.000000e+00 | 5.000000e+02 | 6.950000e+02 | 0.000000e+00 |
| 75% | 4.300000e+00 | 4.200000e+01 | 5.000000e+03 | 7.354000e+03 | 0.000000e+00 |
| max | 5.000000e+00 | 1.385576e+08 | 1.000000e+10 | 1.205763e+10 | 4.000000e+02 |
apps_df.head()
| App Name | App Id | Category | Rating | Rating Count | Installs | Minimum Installs | Maximum Installs | Free | Price | ... | Developer Website | Developer Email | Released | Last Updated | Content Rating | Privacy Policy | Ad Supported | In App Purchases | Editors Choice | Scraped Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Gakondo | com.ishakwe.gakondo | Adventure | 0.0 | 0.0 | 10+ | 10.0 | 15 | True | 0.0 | ... | https://beniyizibyose.tk/#/ | jean21101999@gmail.com | Feb 26, 2020 | Feb 26, 2020 | Everyone | https://beniyizibyose.tk/projects/ | False | False | False | 2021-06-15 20:19:35 |
| 1 | Ampere Battery Info | com.webserveis.batteryinfo | Tools | 4.4 | 64.0 | 5,000+ | 5000.0 | 7662 | True | 0.0 | ... | https://webserveis.netlify.app/ | webserveis@gmail.com | May 21, 2020 | May 06, 2021 | Everyone | https://dev4phones.wordpress.com/licencia-de-uso/ | True | False | False | 2021-06-15 20:19:35 |
| 2 | Vibook | com.doantiepvien.crm | Productivity | 0.0 | 0.0 | 50+ | 50.0 | 58 | True | 0.0 | ... | NaN | vnacrewit@gmail.com | Aug 9, 2019 | Aug 19, 2019 | Everyone | https://www.vietnamairlines.com/vn/en/terms-an... | False | False | False | 2021-06-15 20:19:35 |
| 3 | Smart City Trichy Public Service Vehicles 17UC... | cst.stJoseph.ug17ucs548 | Communication | 5.0 | 5.0 | 10+ | 10.0 | 19 | True | 0.0 | ... | http://www.climatesmarttech.com/ | climatesmarttech2@gmail.com | Sep 10, 2018 | Oct 13, 2018 | Everyone | NaN | True | False | False | 2021-06-15 20:19:35 |
| 4 | GROW.me | com.horodyski.grower | Tools | 0.0 | 0.0 | 100+ | 100.0 | 478 | True | 0.0 | ... | http://www.horodyski.com.pl | rmilekhorodyski@gmail.com | Feb 21, 2020 | Nov 12, 2018 | Everyone | http://www.horodyski.com.pl | False | False | False | 2021-06-15 20:19:35 |
5 rows × 24 columns
#Make a copy of the dataset
df_backup=apps_df.copy()
# Check for duplicate rows
apps_df.sort_values('App Name', ascending=False,inplace=True)
apps_df.duplicated().sum()
0
#Drop irrelevant columns
apps_df.drop(columns=['App Id','Minimum Installs', 'Maximum Installs', 'Minimum Android', 'Developer Id', 'Developer Website', \
'Developer Email', 'Scraped Time', 'Privacy Policy'], inplace=True)
apps_df.columns
Index(['App Name', 'Category', 'Rating', 'Rating Count', 'Installs', 'Free',
'Price', 'Currency', 'Size', 'Released', 'Last Updated',
'Content Rating', 'Ad Supported', 'In App Purchases', 'Editors Choice'],
dtype='object')
#Rename column headings/ Replace unwanted spaces in column headings
apps_df.rename(columns={'App Name': 'App_Name','Rating Count':'Reviews','Last Updated': 'Last_Updated',
'Content Rating': 'Content_Rating', 'Ad Supported': 'Ad_Supported', 'In App Purchases': 'In_App_Purchases',\
'Editors Choice': 'Editors_Choice'},inplace=True)
apps_df.columns
Index(['App_Name', 'Category', 'Rating', 'Reviews', 'Installs', 'Free',
'Price', 'Currency', 'Size', 'Released', 'Last_Updated',
'Content_Rating', 'Ad_Supported', 'In_App_Purchases', 'Editors_Choice'],
dtype='object')
#Check the data types for all columns
apps_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2312944 entries, 1828752 to 2117918 Data columns (total 15 columns): # Column Dtype --- ------ ----- 0 App_Name object 1 Category object 2 Rating float64 3 Reviews float64 4 Installs object 5 Free bool 6 Price float64 7 Currency object 8 Size object 9 Released object 10 Last_Updated object 11 Content_Rating object 12 Ad_Supported bool 13 In_App_Purchases bool 14 Editors_Choice bool dtypes: bool(4), float64(3), object(8) memory usage: 220.6+ MB
#Check number of null values in each column
apps_df.isnull().sum()
App_Name 2 Category 0 Rating 22883 Reviews 22883 Installs 107 Free 0 Price 0 Currency 135 Size 196 Released 71053 Last_Updated 0 Content_Rating 0 Ad_Supported 0 In_App_Purchases 0 Editors_Choice 0 dtype: int64
#Drop rows where app_name is NULL since there are only 2 App_Names with NULL values
apps_df.drop(apps_df[apps_df.App_Name.isnull()].index,inplace=True)
#Replace missing values in Rating and Reviews with median values as the NULL percentage is very high for these columns to drop
apps_df.Rating=apps_df.Rating.fillna(apps_df.Rating.median())
apps_df.Reviews=apps_df.Reviews.fillna(apps_df.Reviews.median())
apps_df.isnull().sum()
App_Name 0 Category 0 Rating 0 Reviews 0 Installs 107 Free 0 Price 0 Currency 135 Size 196 Released 71053 Last_Updated 0 Content_Rating 0 Ad_Supported 0 In_App_Purchases 0 Editors_Choice 0 dtype: int64
#Replace missing values in Released column with Last Updated dates
apps_df['Released']=np.where(apps_df['Released'].isnull(), apps_df['Last_Updated'],apps_df['Released'])
#Replace missing values in Currency with mode value for currency
apps_df.Currency=np.where(apps_df.Currency.isnull(), apps_df['Currency'].mode,apps_df['Currency'])
#Replace missing values in size column with 1000 and standardize all column values to Kb
def impute_size(size):
size=str(size).replace(',','')
value=size[:-1]
if size[-1]=='K' or size[-1]=='k' :
return float(value)
elif size[-1]=='M' or size[-1]=='m':
return float(value)*1000
elif size[-1]=='G' or size[-1]=='g':
return float(value)*1000000
else:
return 1000
apps_df['Size']=apps_df['Size'].apply(impute_size)
##Format Installs - Strip '+' from Installs and remove all commas
apps_df['Installs'] = apps_df['Installs'].apply(lambda x : str(x).replace('+', '').replace(',', ''))
apps_df['Installs']=apps_df['Installs'].apply(lambda x : str(x).replace('nan','0'))
#Convert datatypes - Installs, Released and Last_Updated columns
apps_df['Installs']=pd.to_numeric(apps_df['Installs'],errors='coerce')
apps_df['Released']=pd.to_datetime(apps_df['Released'],errors='coerce')
apps_df['Last_Updated']=pd.to_datetime(apps_df['Last_Updated'],errors='coerce')
apps_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2312942 entries, 1828752 to 1965443 Data columns (total 15 columns): # Column Dtype --- ------ ----- 0 App_Name object 1 Category object 2 Rating float64 3 Reviews float64 4 Installs int64 5 Free bool 6 Price float64 7 Currency object 8 Size float64 9 Released datetime64[ns] 10 Last_Updated datetime64[ns] 11 Content_Rating object 12 Ad_Supported bool 13 In_App_Purchases bool 14 Editors_Choice bool dtypes: bool(4), datetime64[ns](2), float64(4), int64(1), object(4) memory usage: 220.6+ MB
apps_df.isnull().sum()
App_Name 0 Category 0 Rating 0 Reviews 0 Installs 0 Free 0 Price 0 Currency 0 Size 0 Released 0 Last_Updated 0 Content_Rating 0 Ad_Supported 0 In_App_Purchases 0 Editors_Choice 0 dtype: int64
apps_df.Category.value_counts()
Education 241090 Music & Audio 154906 Tools 143988 Business 143771 Entertainment 138276 Lifestyle 118331 Books & Reference 116728 Personalization 89210 Health & Fitness 83510 Productivity 79698 Shopping 75256 Food & Drink 73927 Travel & Local 67288 Finance 65466 Arcade 53791 Puzzle 51168 Casual 50813 Communication 48167 Sports 47483 Social 44734 News & Magazines 42806 Photography 35552 Medical 32065 Action 27555 Maps & Navigation 26722 Simulation 23282 Adventure 23203 Educational 21308 Art & Design 18539 Auto & Vehicles 18280 House & Home 14369 Video Players & Editors 14015 Events 12841 Trivia 11795 Beauty 11772 Board 10588 Racing 10362 Role Playing 10034 Word 8630 Strategy 8526 Card 8179 Weather 7246 Dating 6524 Libraries & Demo 5198 Casino 5076 Music 4202 Parenting 3810 Comics 2862 Name: Category, dtype: int64
# Combine 'Music' and 'Music & Audio' into 'Music'
apps_df['Category'] = apps_df['Category'].str.replace('Music & Audio', 'Music')
# Combine 'Educational' and 'Education' into 'Education'
apps_df['Category'] = apps_df['Category'].str.replace('Educational', 'Education')
apps_df['Category'].value_counts()
Education 262398 Music 159108 Tools 143988 Business 143771 Entertainment 138276 Lifestyle 118331 Books & Reference 116728 Personalization 89210 Health & Fitness 83510 Productivity 79698 Shopping 75256 Food & Drink 73927 Travel & Local 67288 Finance 65466 Arcade 53791 Puzzle 51168 Casual 50813 Communication 48167 Sports 47483 Social 44734 News & Magazines 42806 Photography 35552 Medical 32065 Action 27555 Maps & Navigation 26722 Simulation 23282 Adventure 23203 Art & Design 18539 Auto & Vehicles 18280 House & Home 14369 Video Players & Editors 14015 Events 12841 Trivia 11795 Beauty 11772 Board 10588 Racing 10362 Role Playing 10034 Word 8630 Strategy 8526 Card 8179 Weather 7246 Dating 6524 Libraries & Demo 5198 Casino 5076 Parenting 3810 Comics 2862 Name: Category, dtype: int64
#Convert price to USD and drop the Currency column
cc=CurrencyConverter()
def currency_conv(currency):
if currency not in cc.currencies:
currency=1
else:
currency=cc.convert(1,currency,'USD')
return currency
apps_df['Currency']=apps_df['Currency'].apply(currency_conv)
apps_df.drop(columns=['Currency'],inplace=True)
#Heatmap to visualize the correlation between features
plt.figure(figsize=(15,10))
p = sns.heatmap(apps_df.corr(), annot=True, cmap="YlGnBu")
plt.savefig('Plots/Corr.png')
#Plotting Free Vs Paid Apps, In App Purchases, Add Supported and Editors Choice Apps
figPie, axs = plt.subplots(2, 2, figsize=(10, 10))
axs[0, 0].pie(apps_df.Free.value_counts(), labels=['Free','Paid'], autopct='%1.1f%%', \
explode=(0,0.1),colors=['lightblue','grey'])
axs[0, 0].set_title('Free Vs Paid')
axs[0, 1].pie(apps_df.In_App_Purchases.value_counts(), labels=['No','Yes'], autopct='%1.1f%%',\
explode=(0,0.1),colors=['lightblue','grey'])
axs[0, 1].set_title('In App Purchases')
axs[1, 0].pie(apps_df.Ad_Supported.value_counts(), labels=['Yes','No'], autopct='%1.1f%%', \
explode=(0,0.1),colors=['lightblue','grey'])
axs[1, 0].set_title('Ad Supported')
axs[1, 1].pie(apps_df.Editors_Choice.value_counts(), labels=['No','Yes'], autopct='%1.1f%%', \
explode=(0,0.1),colors=['lightblue','grey'])
axs[1, 1].set_title('Editors Choice')
figPie.savefig('Plots/Pies.png')
plt.show()
#Plotting App Ratings
plt.figure(figsize=(10,5))
sns.histplot(x='Rating',data=apps_df[apps_df.Rating>0],kde=True)
plt.savefig('Plots/RatingCount.png')
#Plotting relationship between number of Reviews and Ratings
plt.figure(figsize=(10,5))
sns.scatterplot(x='Reviews',y='Rating',data=apps_df[apps_df.Reviews>0])
plt.savefig('Plots/ReviewsRatings.png')
plt.show()
Insight: Comparatively low average rating for an app as per Reviews Vs Rating plotting, needs further exploring
#Details of Apps which has more than 10 million reviews but average rating less than 4, which indicates improvement required.
#FB has the lowest rating in the group
filtered_df=apps_df[(apps_df.Reviews>10000000) & (apps_df.Rating<4)].loc[:,['App_Name','Category','Reviews','Rating']].sort_values('Rating').head(10).reset_index()
filtered_df[['App_Name','Rating']].plot(kind='bar', x='App_Name', color='lightblue')
plt.legend=None
plt.title="Ratings for Apps with more than 10 million Reviews"
plt.savefig('Plots/PopularApps.png')
plt.show()
#Plotting Installs against Price, Reviews and Rating
fig, axes = plt.subplots(1, 4, figsize=(15, 5), sharey=True)
plt.figure(figsize=(10,5))
sns.scatterplot(ax=axes[0],x='Price',y='Installs',data=apps_df)
plt.figure(figsize=(10,5))
sns.scatterplot(ax=axes[1],x='Reviews',y='Installs',data=apps_df)
plt.figure(figsize=(10,5))
sns.scatterplot(ax=axes[2],x='Rating',y='Installs',data=apps_df[apps_df.Rating>0])
plt.figure(figsize=(10,5))
sns.scatterplot(ax=axes[3],x='Size',y='Installs',data=apps_df[apps_df.Rating>0])
fig.savefig('Plots/Installs1.png')
plt.show()
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
#Apps Released 2010-2021
fig1=plt.figure(figsize=(20,5))
apps_df['Year_Released'] = apps_df.Released.dt.year
filtered_df=apps_df.groupby(['Year_Released']).agg(Number_of_Apps=('App_Name', 'count')).reset_index()
ax1 = fig1.add_subplot()
ax1.bar(filtered_df['Year_Released'], filtered_df['Number_of_Apps'], color="grey")
fig1.savefig('Plots/Apps.png')
plt.show()
#Top 10 categories by Number of Apps
fig2 = plt.figure(figsize=(12, 8))
top_10_appNums = apps_df.groupby(['Category']).agg(Total_Apps=('App_Name','count')) \
.sort_values('Total_Apps',ascending=False).head(10).reset_index()
ax2 = fig2.add_subplot(111)
ax2.bar(top_10_appNums['Category'], top_10_appNums['Total_Apps'], color="lightblue")
plt.xticks(rotation=90)
fig2.savefig('Plots/topCategories.png')
plt.show()
#Distribution of Ratings Across the top 10 categories by Number of Apps
plt.figure(figsize=(15,8))
sns.boxplot(x='Category',y='Rating',data=apps_df[(apps_df.Category.isin(top_10_appNums.Category)) & (apps_df.Rating>0)])
plt.xticks(rotation=90)
plt.savefig('Plots/Top10Rating.png')
plt.show()
#Bottom 10 categories by Number of Apps
fig = plt.figure(figsize=(20, 8))
bottom_10_appNums = apps_df.groupby(['Category']).agg(Total_Apps=('App_Name','count')) \
.sort_values('Total_Apps',ascending=False).tail(10).reset_index()
ax = fig.add_subplot(111)
ax.bar(bottom_10_appNums['Category'], bottom_10_appNums['Total_Apps'], color="salmon")
fig.savefig('Plots/Bottom10.png')
plt.show()
#Distribution of Ratings Across the bottom 10 categories by Number of Apps
plt.figure(figsize=(15,8))
sns.boxplot(x='Category',y='Rating',data=apps_df[(apps_df.Category.isin(bottom_10_appNums.Category)) & (apps_df.Rating>0)])
plt.xticks(rotation=90)
plt.savefig('Plots/Bottom10Rating.png')
plt.show()
#Top 10 categories by Installs
fig = plt.figure(figsize=(20, 8))
top_10_Installs = apps_df.groupby(['Category']).agg(Total_Installs=('Installs',sum)) \
.sort_values('Total_Installs',ascending=False).head(10).reset_index()
ax = fig.add_subplot(111)
ax.bar(top_10_Installs['Category'], top_10_Installs['Total_Installs'], color="lightblue")
fig.savefig('Plots/Top10Installs.png')
plt.show()
#Distribution of Ratings Across the top 10 categories by Installs
plt.figure(figsize=(15,8))
sns.boxplot(x='Category',y='Rating',data=apps_df[(apps_df.Category.isin(top_10_Installs.Category)) & (apps_df.Rating>0)])
plt.xticks(rotation=90)
plt.savefig('Plots/Top10InstallsRating.png')
plt.show()
#Bottom 10 categories by Installs
fig = plt.figure(figsize=(20, 8))
bottom_10_Installs = apps_df.groupby(['Category']).agg(Total_Installs=('Installs',sum)) \
.sort_values('Total_Installs',ascending=False).tail(10).reset_index()
ax = fig.add_subplot(111)
ax.bar(bottom_10_Installs['Category'], bottom_10_Installs['Total_Installs'], color="salmon")
fig.savefig('Plots/Bottom10Installs.png')
plt.show()
#Distribution of Ratings Across the bottom 10 categories by Installs
plt.figure(figsize=(15,8))
sns.boxplot(x='Category',y='Rating',data=apps_df[(apps_df.Category.isin(bottom_10_Installs.Category)) & (apps_df.Rating>0)])
plt.xticks(rotation=90)
plt.savefig('Plots/Bottom10InstallsRating.png')
plt.show()
Insight: Irrespective of the number of installs, both top and bottom 10 categories tend to have 75% ratings less than 4
#Number of apps by Content Rating
temp=apps_df.groupby('Content_Rating').agg(Apps_Count=('App_Name','count')).reset_index()
temp.sort_values('Apps_Count', ascending=False, inplace=True)
plt.figure(figsize=(10,5))
sns.barplot(x='Content_Rating',y='Apps_Count',data=temp, palette="Set2")
plt.savefig('Plots/ContentRating.png')
plt.show()
Insight: Teen specific apps are less in number compared to General apps
#Installs by Content Rating
temp=apps_df.groupby('Content_Rating').agg(Total_Installs=('Installs',sum)).reset_index()
temp.sort_values('Total_Installs', ascending=False, inplace=True)
plt.figure(figsize=(10,5))
sns.barplot(x='Content_Rating',y='Total_Installs',data=temp)
plt.savefig('Plots/ContentRatingInstalls.png')
plt.show()
Insight: Teen specific apps though much less in number compared to General apps have comparatively good downloads
#Installs by Content Rating/Categories
fig = px.sunburst(apps_df, path=['Content_Rating','Category'], values='Installs', width=800, height=700)
fig.update_layout(title_text='Installs by Content Rating/Categories ', title_font_size=20)
fig.show()
Insight: Social, video players, Entertainment, Actions and Music are the most popular categories in Content Rating group Teen
General Content Rating group has Tools, Communication and Productivity as the most popular categories
#Social apps by Content Rating
apps_df[ apps_df.Category=='Social'].Content_Rating.value_counts()
Teen 23560 Everyone 15570 Mature 17+ 5523 Everyone 10+ 74 Unrated 4 Adults only 18+ 3 Name: Content_Rating, dtype: int64
#Ratings Vs Content Rating
plt.figure(figsize=(15,8))
sns.boxplot(x='Content_Rating',y='Rating',data=apps_df[apps_df.Rating>0])
plt.savefig('Plots/ContentRatingRating.png')
plt.show()
- Steady increase in the number of apps released with highest number of apps released in 2020, 2021 data is incomplete
however might point to a slight decrease
- Most number of apps are rated in the 4-4.5 range
- Popular apps Rating shows much room for improvement
- App Size inversely proportional to number of installs
- Higher Ratings and Number of Installs have significant correlation
- While there are more number of apps in Education category, Tools category is more popular with highest number of installs
- Dating, Parenting etc have lesser number of apps
- No significant difference in mean ratings across apps in the top and bottom categories. 75% of apps across
significant number of top and bottom 10 categories have ratings less than 4
- Most of the apps have content rating 'Everyone'.
- For Content Rating Category 'Everyone', Tools, Communication and Productivity apps have more installs
Social, Video Editors, Entertainment, Action and Music are popular categories in Teen rated apps
File "<ipython-input-42-6c0be6c4d8be>", line 1 - Steady increase in the number of apps released with highest number of apps released in 2020, 2021 data is incomplete ^ SyntaxError: invalid syntax